HEAD ======= >>>>>>> b0d26ca57d19a113b96a754cdf1d9ad0a4b3d5be
The downloaded binary packages are in
/var/folders/9s/0zkjn8tm8xj7wp0059bl3v9000020t/T//RtmpiUigeM/downloaded_packages
>>>>>>> b0d26ca57d19a113b96a754cdf1d9ad0a4b3d5be
| libraries |
|---|
| knitr |
| FactoMineR |
| factoextra |
| pheatmap |
| gprofiler2 |
| pheatmap |
| biomaRt |
R version 4.0.2 (2020-06-22)
<<<<<<< HEAD
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /shared/ifbstor1/software/miniconda/envs/r-4.0.2/lib/libopenblasp-r0.3.10.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
=======
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>>>> b0d26ca57d19a113b96a754cdf1d9ad0a4b3d5be
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
<<<<<<< HEAD
[1] biomaRt_2.46.3 pheatmap_1.0.12 factoextra_1.0.7 ggplot2_3.3.3 FactoMineR_2.3 knitr_1.30
loaded via a namespace (and not attached):
[1] ggrepel_0.9.1 Rcpp_1.0.6 lattice_0.20-41 prettyunits_1.1.1 assertthat_0.2.1 digest_0.6.27 utf8_1.2.1 BiocFileCache_1.14.0 R6_2.5.0 stats4_4.0.2 RSQLite_2.2.3 evaluate_0.14 highr_0.8 httr_1.4.2
[15] pillar_1.5.1 rlang_0.4.10 progress_1.2.2 curl_4.3 blob_1.2.1 S4Vectors_0.28.1 rmarkdown_2.5 stringr_1.4.0 bit_4.0.4 munsell_0.5.0 compiler_4.0.2 xfun_0.20 askpass_1.1 pkgconfig_2.0.3
[29] BiocGenerics_0.36.0 htmltools_0.5.1.1 openssl_1.4.3 flashClust_1.01-2 tidyselect_1.1.0 tibble_3.1.0 IRanges_2.24.1 XML_3.99-0.5 fansi_0.4.2 dbplyr_2.0.0 crayon_1.4.1 dplyr_1.0.5 withr_2.4.1 rappdirs_0.3.3
[43] MASS_7.3-53 leaps_3.1 grid_4.0.2 gtable_0.3.0 lifecycle_1.0.0 DBI_1.1.1 magrittr_2.0.1 scales_1.1.1 stringi_1.5.3 debugme_1.1.0 cachem_1.0.4 scatterplot3d_0.3-41 xml2_1.3.2 ellipsis_0.3.1
[57] generics_0.1.0 vctrs_0.3.6 RColorBrewer_1.1-2 tools_4.0.2 bit64_4.0.5 Biobase_2.50.0 glue_1.4.2 purrr_0.3.4 hms_1.0.0 parallel_4.0.2 fastmap_1.1.0 yaml_2.2.1 AnnotationDbi_1.52.0 colorspace_2.0-0
[71] cluster_2.1.0 BiocManager_1.30.10 memoise_2.0.0
=======
[1] biomaRt_2.44.4 pheatmap_1.0.12 gprofiler2_0.2.0 factoextra_1.0.7 ggplot2_3.3.3 FactoMineR_2.4 knitr_1.32
loaded via a namespace (and not attached):
[1] Biobase_2.48.0 httr_1.4.2 sass_0.3.1 tidyr_1.1.3 bit64_4.0.5 jsonlite_1.7.2 viridisLite_0.4.0 bslib_0.2.4 assertthat_0.2.1 askpass_1.1 highr_0.9 BiocFileCache_1.12.1 BiocManager_1.30.12 stats4_4.0.2
[15] blob_1.2.1 yaml_2.2.1 progress_1.2.2 ggrepel_0.9.1 pillar_1.6.0 RSQLite_2.2.6 lattice_0.20-41 glue_1.4.2 digest_0.6.27 RColorBrewer_1.1-2 colorspace_2.0-0 htmltools_0.5.1.1 XML_3.99-0.6 pkgconfig_2.0.3
[29] purrr_0.3.4 scales_1.1.1 tibble_3.1.0 openssl_1.4.3 generics_0.1.0 IRanges_2.22.2 ellipsis_0.3.1 DT_0.18 cachem_1.0.4 withr_2.4.1 BiocGenerics_0.34.0 lazyeval_0.2.2 magrittr_2.0.1 crayon_1.4.1
[43] memoise_2.0.0 evaluate_0.14 fansi_0.4.2 MASS_7.3-53.1 xml2_1.3.2 tools_4.0.2 data.table_1.14.0 prettyunits_1.1.1 hms_1.0.0 lifecycle_1.0.0 stringr_1.4.0 plotly_4.9.3 S4Vectors_0.26.1 munsell_0.5.0
[57] cluster_2.1.1 AnnotationDbi_1.50.3 flashClust_1.01-2 compiler_4.0.2 jquerylib_0.1.3 rlang_0.4.10 grid_4.0.2 rappdirs_0.3.3 htmlwidgets_1.5.3 leaps_3.1 rmarkdown_2.7 gtable_0.3.0 curl_4.3 DBI_1.1.1
[71] R6_2.5.0 dplyr_1.0.5 fastmap_1.1.0 bit_4.0.4 utf8_1.2.1 stringi_1.5.3 parallel_4.0.2 Rcpp_1.0.6 vctrs_0.3.7 dbplyr_2.1.1 scatterplot3d_0.3-41 tidyselect_1.1.0 xfun_0.22
>>>>>>> b0d26ca57d19a113b96a754cdf1d9ad0a4b3d5be
Le but de ce travail est de mettre en oeuvre les méthodes vues dans le module 3 “R et statistiques” pour explorer le jeu de données de Pavokovic, et de rendre un rapport d’analyse au format .Rmd.
Nous fournissons ci-dessous une trame avec les principales sections attendues. Certaines contiennent déjà du code. Vous devrez en compléter d’autres. Sentez-vous libres d’adapter cette trame ou d’y ajouter des analyses complémentaires si elles vous aident à interpréter vos résultats.
Date: le 10 mai 2021 minuit. Si vous anticipez un problème pour remettre le rapport à cette date contactez-nous aussi rapidement que possible pour que nous puissions prévoir une remise plus tardive.
Commencez par renommer le fichier .Rmd en remplaçant Prenom-NOM par vos nom et prénom.
Le rapport est attendu en formats .Rmd + .HTML (en gardant l’option self_contained de l’en-tête activée).
Déposez les fichiers dans un sous-dossier de vote compte du cluster. Attention, veillez à respecter précisément cette structure de chemin car nous nous baserons dessus pour récupérer vos résultats.
/shared/projects/dubii2021/[login]/m3-stat-R/mini-projet
Nous vous encourageons à assurer la lisibilité de votre code (syntaxe, nommage des variables, commentaires de code)
Nous partons du même jeu de données Fil Rouge de ce module issues de la publication Pavkovic, M., Pantano, L., Gerlach, C.V. et al. Multi omics analysis of fibrotic kidneys in two mouse models. Sci Data 6, 92 (2019). https://doi.org/10.1038/s41597-019-0095-5
Rappel sur les échantillons:
Deux modèles de fibrose rénale chez la souris sont étudiés:
Le premier est un modèle de néphropathie réversible induite par l’acide folique (folic acid (FA)). Les souris ont été sacrifiées avant le traitement (normal), puis à jour 1, 2, 7 et 14 (day1,…) après une seule injection d’acide folique.
Le second est un modèle irréversible induit chrirurgicalement (unilateral ureteral obstruction (UUO)). les souris ont été sacrifiées avant obstruction (day 0) et à 3, 7 et 14 jours après obstruction par ligation de l’uretère du rein gauche.
A partir de ces extraits de rein, l’ARN messager total et les petits ARNs ont été séquencés et les protéines caratérisées par spectrométrie de masse en tandem (TMT).
But scientifique: Dans le tutoriel sur les dataframes, vous avez travaillé sur les données de transcriptome du modèle UUO. Dans ce mini-projet, vous allez travailler sur les données du transcriptome du modèle FA afin de regrouper les observations (échantillon) et les gènes selon des profils d’expression similaires.
Votre projet se décompose en 4 parties:
Vous n’avez rien à coder ici. Le code est fourni.
Le bloc suivant contient une fonction qui permet de télécharger un fichier dans l’espace de travail, sauf s’il est déjà présent. Nous l’utiliserons ensuite pour télécharger les données à analyser en évitant de refaire le transfert à chaque exécution de l’analyse.
<<<<<<< HEAD#' @title Download a file only if it is not yet here
#' @author Jacques van Helden email{Jacques.van-Helden@@france-bioinformatique.fr}
#' @param url_base base of the URL, that will be prepended to the file name
#' @param file_name name of the file (should not contain any path)
#' @param local_folder path of a local folder where the file should be stored
#' @return the function returns the path of the local file, built from local_folder and file_name
#' @export©
download_only_once <- function(
url_base,
file_name,
local_folder) {
## Define the source URL
url <- file.path(url_base, file_name)
message("Source URL\n\t", url)
## Define the local file
local_file <- file.path(local_folder, file_name)
## Create the local data folder if it does not exist
dir.create(local_folder, showWarnings = FALSE, recursive = TRUE)
## Download the file ONLY if it is not already there
if (!file.exists(local_file)) {
message("Downloading file from source URL to local file\n\t",
local_file)
download.file(url = url, destfile = local_file)
} else {
message("Local file already exists, no need to download\n\t",
local_file)
}
return(local_file)
}#' @title Download a file only if it is not yet here
#' @author Jacques van Helden email{Jacques.van-Helden@@france-bioinformatique.fr}
#' @param url_base base of the URL, that will be prepended to the file name
#' @param file_name name of the file (should not contain any path)
#' @param local_folder path of a local folder where the file should be stored
#' @return the function returns the path of the local file, built from local_folder and file_name
#' @export©
download_only_once <- function(
url_base,
file_name,
local_folder) {
## Define the source URL
url <- file.path(url_base, file_name)
message("Source URL\n\t", url)
## Define the local file
local_file <- file.path(local_folder, file_name)
## Create the local data folder if it does not exist
dir.create(local_folder, showWarnings = FALSE, recursive = TRUE)
## Download the file ONLY if it is not already there
if (!file.exists(local_file)) {
message("Downloading file from source URL to local file\n\t",
local_file)
download.file(url = url, destfile = local_file)
} else {
message("Local file already exists, no need to download\n\t",
local_file)
}
return(local_file)
}Nous téléchargeons deux fichiers dans un dossier local ~/m3-stat-R/pavkovic_analysis (vous pouvez changer le nom ou chemin dans le chunk ci-dessous), et les chargeons dans les data.frames suivants:
fa_expr_rawfa_meta## Define the remote URL and local folder
pavkovic_url <- "https://github.com/DU-Bii/module-3-Stat-R/raw/master/stat-R_2021/data/pavkovic_2019/"
## Define the local folder for this analysis (where the data will be downloaded and the results generated)
pavkovic_folder <- "~/m3-stat-R/pavkovic_analysis"
## Define a sub-folder for the data
pavkovic_data_folder <- file.path(pavkovic_folder, "data")
## Download and load the expression data table
## Note: we use check.names=FALSE to avoid replacing hyphens by dots
## in sample names, because we want to keep them as in the
## original data files.
message("Downloading FA transcriptome file\t", "fa_raw_counts.tsv.gz",
"\n\tfrom\t", pavkovic_url)
fa_expr_file <- download_only_once(
url_base = pavkovic_url,
file_name = "fa_raw_counts.tsv.gz",
local_folder = pavkovic_data_folder)
## Load the expresdsion table
message("Loading FA transcriptome data from\n\t", fa_expr_file)
fa_expr_raw <- read.delim(file = fa_expr_file,
header = TRUE,
row.names = 1)
## Download the metadata file
message("Downloading FA metadata file\t", "fa_transcriptome_metadata.tsv",
"\n\tfrom\t", pavkovic_url)
fa_meta_file <- download_only_once(
url_base = pavkovic_url,
file_name = "fa_transcriptome_metadata.tsv",
local_folder = pavkovic_data_folder)
## Load the metadata
message("Loading FA metadata from\n\t", fa_meta_file)
fa_meta <- read.delim(file = fa_meta_file,
header = TRUE,
row.names = 1)Nous regardons la structure de chaque dataframe.
str(fa_expr_raw)'data.frame': 46679 obs. of 18 variables:
=======
## Define the remote URL and local folder
pavkovic_url <- "https://github.com/DU-Bii/module-3-Stat-R/raw/master/stat-R_2021/data/pavkovic_2019/"
## Define the local folder for this analysis (where the data will be downloaded and the results generated)
pavkovic_folder <- "~/m3-stat-R/pavkovic_analysis"
## Define a sub-folder for the data
pavkovic_data_folder <- file.path(pavkovic_folder, "data")
## Download and load the expression data table
## Note: we use check.names=FALSE to avoid replacing hyphens by dots
## in sample names, because we want to keep them as in the
## original data files.
message("Downloading FA transcriptome file\t", "fa_raw_counts.tsv.gz",
"\n\tfrom\t", pavkovic_url)
fa_expr_file <- download_only_once(
url_base = pavkovic_url,
file_name = "fa_raw_counts.tsv.gz",
local_folder = pavkovic_data_folder)
## Load the expresdsion table
message("Loading FA transcriptome data from\n\t", fa_expr_file)
fa_expr_raw <- read.delim(file = fa_expr_file,
header = TRUE,
row.names = 1)
## Download the metadata file
message("Downloading FA metadata file\t", "fa_transcriptome_metadata.tsv",
"\n\tfrom\t", pavkovic_url)
fa_meta_file <- download_only_once(
url_base = pavkovic_url,
file_name = "fa_transcriptome_metadata.tsv",
local_folder = pavkovic_data_folder)
## Load the metadata
message("Loading FA metadata from\n\t", fa_meta_file)
fa_meta <- read.delim(file = fa_meta_file,
header = TRUE,
row.names = 1)
Nous regardons la structure de chaque dataframe.
'data.frame': 46679 obs. of 18 variables:
>>>>>>> b0d26ca57d19a113b96a754cdf1d9ad0a4b3d5be
$ day1_1 : num 2278.8 0 36.3 13.2 0 ...
$ day1_2 : num 1786.5 0 22.15 7.15 27.9 ...
$ day1_3 : num 2368.62 0 39.48 1.12 6.9 ...
$ day14_1 : num 627.758 0 14.471 0.867 5.692 ...
$ day14_2 : num 559.2 0 10.2 0 1.9 ...
$ day14_3 : num 611.434 0 31.691 0 0.655 ...
$ day2_1 : num 2145.22 0 300.56 1.71 57.38 ...
$ day2_2 : num 262.45 0 4.77 0 0 ...
$ day2_3 : num 745.84 0 123.9 5.26 38.9 ...
$ day3_1 : num 987.185 0 51.856 0.802 8.931 ...
$ day3_2 : num 1077.65 0 8.43 0 6.97 ...
$ day3_3 : num 1335.1 0 69.9 0 0 ...
$ day7_1 : num 1096.08 0 6.67 0 7.94 ...
$ day7_2 : num 1035.846 0 6.955 0.849 101.648 ...
$ day7_3 : num 1090.04 0 42.58 1.71 0.65 ...
$ normal_1: num 483.23 0 7.35 0.86 32.06 ...
$ normal_2: num 1842.1 0 11.2 0 10.4 ...
$ normal_3: num 475.7 0 1.03 0 0 ...
<<<<<<< HEAD
str(fa_meta)
'data.frame': 18 obs. of 5 variables:
=======
'data.frame': 18 obs. of 5 variables:
>>>>>>> b0d26ca57d19a113b96a754cdf1d9ad0a4b3d5be
$ dataType : chr "transcriptome" "transcriptome" "transcriptome" "transcriptome" ...
$ sampleName : chr "day14_1" "day14_2" "day14_3" "day1_1" ...
$ condition : chr "day14" "day14" "day14" "day1" ...
$ sampleNumber: int 1 2 3 1 2 3 1 2 3 1 ...
$ color : chr "#FF4400" "#FF4400" "#FF4400" "#BBD7FF" ...
Les deux fichiers ne donnent pas les observations de l’échantillon dans le même ordre:
<<<<<<< HEAD
fa_meta$sampleName == names(fa_expr_raw)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Nous les réorganisons les échantillons dans l’ordre de l’expérience: condition normale, puis day 1 à 14 avec les 3 réplicats.
sample_order <- c(paste(rep(c("normal", "day1", "day2", "day3", "day7", "day14"), each = 3),
1:3, sep = "_"))
fa_expr_raw <- fa_expr_raw[,sample_order]
fa_meta <- fa_meta[match(sample_order, fa_meta$sampleName),]
# View(fa_meta)
kable(fa_meta, caption = "Metdata for Pavkovoc FA transcriptome")
=======
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Nous les réorganisons les échantillons dans l’ordre de l’expérience: condition normale, puis day 1 à 14 avec les 3 réplicats.
sample_order <- c(paste(rep(c("normal", "day1", "day2", "day3", "day7", "day14"), each = 3),
1:3, sep = "_"))
fa_expr_raw <- fa_expr_raw[,sample_order]
fa_meta <- fa_meta[match(sample_order, fa_meta$sampleName),]
# View(fa_meta)
kable(fa_meta, caption = "Metdata for Pavkovoc FA transcriptome")
>>>>>>> b0d26ca57d19a113b96a754cdf1d9ad0a4b3d5be
Metdata for Pavkovoc FA transcriptome
dataType
sampleName
condition
sampleNumber
color
16
transcriptome
normal_1
normal
1
#BBFFBB
17
transcriptome
normal_2
normal
2
#BBFFBB
18
transcriptome
normal_3
normal
3
#BBFFBB
4
transcriptome
day1_1
day1
1
#BBD7FF
5
transcriptome
day1_2
day1
2
#BBD7FF
6
transcriptome
day1_3
day1
3
#BBD7FF
7
transcriptome
day2_1
day2
1
#F0BBFF
8
transcriptome
day2_2
day2
2
#F0BBFF
9
transcriptome
day2_3
day2
3
#F0BBFF
10
transcriptome
day3_1
day3
1
#FFFFDD
11
transcriptome
day3_2
day3
2
#FFFFDD
12
transcriptome
day3_3
day3
3
#FFFFDD
13
transcriptome
day7_1
day7
1
#FFDD88
14
transcriptome
day7_2
day7
2
#FFDD88
15
transcriptome
day7_3
day7
3
#FFDD88
1
transcriptome
day14_1
day14
1
#FF4400
2
transcriptome
day14_2
day14
2
#FF4400
3
transcriptome
day14_3
day14
3
#FF4400
=> Ainsi, nous avons un jeu de données avec un échantillon de 18 observations et des données d’expression de 46679 gènes.
Appliquez une transformation log2 des données brutes, après avoir ajouté un epsilon \(\epsilon = 1\) (les valeurs nulles seront donc représentées par un log2(counts) valant \(0\). Stockez le résultat dans un data.frame nommé fa_expr_log2.
Affchez un fragment des tableaux fa_expr_raw et fa_expr_log2 en sélectionnant les lignes 100 à 109 et les colonnes 5 à 10, afin de vous assurer que la transformation log2 a bien fonctionné.
## Log2 transformation of the transcriptome data
epsilon <- 1
fa_expr_log2 <- log2(fa_expr_raw + epsilon)
# dim(fa_expr_log2)
# View(head(fa_expr_log2))
## Display of a fragment of the data before and after log2 transformation
kable(fa_expr_raw[100:109, 5:10], caption = "Fragment des données transcriptomiques brutes")## Log2 transformation of the transcriptome data
epsilon <- 1
fa_expr_log2 <- log2(fa_expr_raw + epsilon)
# dim(fa_expr_log2)
# View(head(fa_expr_log2))
## Display of a fragment of the data before and after log2 transformation
kable(fa_expr_raw[100:109, 5:10], caption = "Fragment des données transcriptomiques brutes")| day1_2 | day1_3 | day2_1 | day2_2 | day2_3 | day3_1 | |
|---|---|---|---|---|---|---|
| ENSMUSG00000000567 | 599.648075 | 304.093177 | 1052.689447 | 106.995584 | 347.13842 | 479.59911 |
| ENSMUSG00000000568 | 1008.026306 | 1349.126716 | 818.257157 | 116.136417 | 3406.45030 | 766.09722 |
| ENSMUSG00000000579 | 599.832586 | 500.735597 | 473.399472 | 36.258005 | 410.57787 | 347.05054 |
| ENSMUSG00000000581 | 942.511039 | 744.646735 | 546.260344 | 87.788535 | 319.12679 | 461.45732 |
| ENSMUSG00000000594 | 3063.845705 | 2743.404374 | 2283.051270 | 1115.588250 | 1491.61384 | 1576.68451 |
| ENSMUSG00000000600 | 3341.854530 | 3561.805180 | 3674.188108 | 589.840068 | 1399.10751 | 3446.01856 |
| ENSMUSG00000000605 | 791.312561 | 558.710943 | 489.579657 | 77.008213 | 256.00200 | 282.80077 |
| ENSMUSG00000000606 | 4.785252 | 6.566374 | 3.970399 | 0.000000 | 138.48677 | 0.00000 |
| ENSMUSG00000000617 | 15.839486 | 29.138902 | 30.228506 | 6.670500 | 18.27493 | 13.41152 |
| ENSMUSG00000000627 | 36.673777 | 35.967747 | 46.297517 | 2.878275 | 17.03638 | 15.45854 |
kable(fa_expr_log2[100:109, 5:10], caption = "Fragment des données transcriptomiques après transformation log2")| day1_2 | day1_3 | day2_1 | day2_2 | day2_3 | day3_1 | |
|---|---|---|---|---|---|---|
| ENSMUSG00000000567 | 9.230376 | 8.253106 | 10.041234 | 6.754829 | 8.443517 | 8.908690 |
| ENSMUSG00000000568 | 9.978748 | 10.398879 | 9.678173 | 6.872046 | 11.734477 | 9.583266 |
| ENSMUSG00000000579 | 9.230819 | 8.970783 | 8.889959 | 5.219479 | 8.685022 | 8.443153 |
| ENSMUSG00000000581 | 9.881896 | 9.542348 | 9.096084 | 6.472302 | 8.322500 | 8.853176 |
| ENSMUSG00000000594 | 11.581599 | 11.422277 | 11.157379 | 10.124882 | 10.543625 | 10.623593 |
| ENSMUSG00000000600 | 11.706865 | 11.798798 | 11.843602 | 9.206624 | 10.451322 | 11.751133 |
| ENSMUSG00000000605 | 9.629926 | 9.128538 | 8.938344 | 6.285554 | 8.005636 | 8.148735 |
| ENSMUSG00000000606 | 2.532380 | 2.919602 | 2.313362 | 0.000000 | 7.123984 | 0.000000 |
| ENSMUSG00000000617 | 4.073776 | 4.913555 | 4.964792 | 2.939321 | 4.268654 | 3.849151 |
| ENSMUSG00000000627 | 5.235489 | 5.208195 | 5.563693 | 1.955415 | 4.172838 | 4.040765 |
Dans le tutorial sur les dataframes sur le jeu de données “uuo” (relisez le corrigé), nous vous avons demandé de créer un data.frame qui collectera les statistiques par gène et par échantillon. Nous vous demandons de réaliser une étude similaire sur les données “FA”.
Nous créons un data.frame nommé sample_stat_prenorm qui comportera une ligne par échantillon et une colonne par statistique. Nous calculons les statistiques suivantes sur les valeurs log2 d’expression de chaque échantillon:
Il sera affiché avec la fonction kable() (n’oubliez pas la légende).
sample_stat_prenorm <- data.frame(
mean = apply(fa_expr_log2, 2, mean, na.rm=TRUE),
sd = apply(fa_expr_log2, 2, sd, na.rm=TRUE),
iqr = apply(fa_expr_log2, 2, IQR, na.rm=TRUE),
Q1 = apply(fa_expr_log2, 2, quantile, p = 0.25, na.rm=TRUE),
median = apply(fa_expr_log2, 2, median, na.rm=TRUE),
Q3 = apply(fa_expr_log2, 2, quantile, p = 0.75, na.rm=TRUE),
max = apply(fa_expr_log2, 2, max, na.rm=TRUE),
null = apply(fa_expr_log2 == 0, 2, sum, na.rm=TRUE)
)
kable(sample_stat_prenorm, caption = "Sample-wise statistics before normalisation.")message("Computing sample-wise statistics on raw counts")
sample_stat_prenorm <- data.frame(
mean = apply(fa_expr_log2, 2, mean, na.rm=TRUE),
sd = apply(fa_expr_log2, 2, sd, na.rm=TRUE),
iqr = apply(fa_expr_log2, 2, IQR, na.rm=TRUE),
Q1 = apply(fa_expr_log2, 2, quantile, p = 0.25, na.rm=TRUE),
median = apply(fa_expr_log2, 2, median, na.rm=TRUE),
Q3 = apply(fa_expr_log2, 2, quantile, p = 0.75, na.rm=TRUE),
max = apply(fa_expr_log2, 2, max, na.rm=TRUE),
null = apply(fa_expr_log2 == 0, 2, sum, na.rm=TRUE)
)
kable(sample_stat_prenorm, caption = "Sample-wise statistics before normalisation.")| mean | sd | iqr | Q1 | median | Q3 | max | null | |
|---|---|---|---|---|---|---|---|---|
| normal_1 | 2.876096 | 3.382687 | 5.566333 | 0 | 1.115495 | 5.566333 | 23.48952 | 21415 |
| normal_2 | 3.475984 | 3.851531 | 6.702346 | 0 | 1.991260 | 6.702346 | 24.43355 | 20203 |
| normal_3 | 2.796962 | 3.301372 | 5.419783 | 0 | 1.057748 | 5.419783 | 23.47218 | 21660 |
| day1_1 | 3.003791 | 3.526143 | 5.901203 | 0 | 1.074190 | 5.901203 | 23.65130 | 21700 |
| day1_2 | 3.516314 | 3.877108 | 6.816493 | 0 | 2.019190 | 6.816493 | 24.58225 | 20152 |
| day1_3 | 3.607632 | 3.917601 | 6.946987 | 0 | 2.252981 | 6.946987 | 24.70005 | 19621 |
| day2_1 | 3.608242 | 3.944769 | 7.000535 | 0 | 2.170400 | 7.000535 | 24.32636 | 19978 |
| day2_2 | 2.088503 | 2.832614 | 3.958139 | 0 | 0.000000 | 3.958139 | 21.94520 | 24446 |
| day2_3 | 3.071767 | 3.573520 | 6.008117 | 0 | 1.309292 | 6.008117 | 23.47538 | 21455 |
| day3_1 | 3.233264 | 3.638371 | 6.300924 | 0 | 1.621087 | 6.300924 | 23.41316 | 20772 |
| day3_2 | 3.163046 | 3.595828 | 6.231985 | 0 | 1.556653 | 6.231985 | 23.03797 | 21142 |
| day3_3 | 3.440775 | 3.777905 | 6.727257 | 0 | 1.991235 | 6.727257 | 23.97460 | 20287 |
| day7_1 | 3.296409 | 3.667439 | 6.515860 | 0 | 1.703173 | 6.515860 | 23.34983 | 20865 |
| day7_2 | 3.251819 | 3.587848 | 6.350051 | 0 | 1.822801 | 6.350051 | 23.20695 | 20467 |
| day7_3 | 3.289691 | 3.617495 | 6.400883 | 0 | 1.954087 | 6.400883 | 23.53623 | 20045 |
| day14_1 | 3.102341 | 3.484017 | 6.044672 | 0 | 1.597272 | 6.044672 | 23.41418 | 20496 |
| day14_2 | 3.147517 | 3.557991 | 6.130466 | 0 | 1.600695 | 6.130466 | 23.87931 | 20870 |
| day14_3 | 3.192445 | 3.533419 | 6.250143 | 0 | 1.699228 | 6.250143 | 23.46460 | 20396 |
Nous créons ci-dessous un data.frame nommé gene_stat_prenorm qui comportera une ligne par gène et une colonne par statistique. Nous calculons les statistiques suivantes sur les valeurs log2 de chaque gène.
Ces résultats seront stockés dans un data.frame avec 1 ligne par échantillon et 1 colonne par statistique. Vous afficherez les lignes 100 à 109 de ce tableau de statistiques avec la fonction kable() (n’oubliez pas la légende).
## Gene-wise statistics before normalisation
gene_stat_prenorm <- data.frame(
mean = apply(fa_expr_raw, 1, mean, na.rm = TRUE),
sd = apply(fa_expr_raw, 1, sd, na.rm = TRUE),
iqr = apply(fa_expr_raw, 1, IQR, na.rm = TRUE),
Q1 = apply(fa_expr_raw, 1, quantile, p = 0.25, na.rm = TRUE),
median = apply(fa_expr_raw, 1, median, na.rm = TRUE),
Q3 = apply(fa_expr_raw, 1, quantile, p = 0.75, na.rm = TRUE),
max = apply(fa_expr_raw, 1, max, na.rm = TRUE),
null = apply(fa_expr_raw == 0, 1, sum, na.rm = TRUE)
)
kable(gene_stat_prenorm[100:109, ], caption = "Gene-wise statistics before normalisation")## Gene-wise statistics for the raw counts (will be used for normalisation)
message("Computing gene-wise statistics on raw counts")
gene_stat_prenorm <- data.frame(
mean = apply(fa_expr_raw, 1, mean, na.rm = TRUE),
sd = apply(fa_expr_raw, 1, sd, na.rm = TRUE),
iqr = apply(fa_expr_raw, 1, IQR, na.rm = TRUE),
Q1 = apply(fa_expr_raw, 1, quantile, p = 0.25, na.rm = TRUE),
median = apply(fa_expr_raw, 1, median, na.rm = TRUE),
Q3 = apply(fa_expr_raw, 1, quantile, p = 0.75, na.rm = TRUE),
max = apply(fa_expr_raw, 1, max, na.rm = TRUE),
null = apply(fa_expr_raw == 0, 1, sum, na.rm = TRUE)
)
kable(gene_stat_prenorm[100:109, ], caption = "Gene-wise statistics before normalisation")| mean | sd | iqr | Q1 | median | Q3 | max | null | |
|---|---|---|---|---|---|---|---|---|
| ENSMUSG00000000567 | 320.336744 | 286.399084 | 364.362298 | 91.046357 | 261.659973 | 455.40866 | 1052.68945 | 0 |
| ENSMUSG00000000568 | 857.348825 | 694.432189 | 282.978949 | 575.571199 | 711.096105 | 858.55015 | 3406.45030 | 0 |
| ENSMUSG00000000579 | 422.075702 | 162.290415 | 157.732606 | 340.417017 | 447.572742 | 498.14962 | 728.51737 | 0 |
| ENSMUSG00000000581 | 498.978227 | 248.612029 | 259.324338 | 320.599085 | 471.269224 | 579.92342 | 942.51104 | 0 |
| ENSMUSG00000000594 | 2113.558334 | 1237.521694 | 937.571105 | 1400.226182 | 1868.718762 | 2337.79729 | 6408.96044 | 0 |
| ENSMUSG00000000600 | 2299.978163 | 911.222891 | 1480.192582 | 1719.874980 | 2093.572929 | 3200.06756 | 3674.18811 | 0 |
| ENSMUSG00000000605 | 325.552118 | 169.456624 | 195.879561 | 221.735904 | 282.608205 | 417.61546 | 791.31256 | 0 |
| ENSMUSG00000000606 | 9.929258 | 32.151350 | 2.736779 | 1.178311 | 1.934744 | 3.91509 | 138.48677 | 4 |
| ENSMUSG00000000617 | 18.804116 | 9.470819 | 13.654701 | 12.188276 | 18.150391 | 25.84298 | 36.00235 | 0 |
| ENSMUSG00000000627 | 15.623306 | 14.341852 | 12.308127 | 4.780913 | 10.064895 | 17.08904 | 46.29752 | 0 |
Vous n’avez rien à coder ici. Le code est fourni.
Il existe plusieurs façons de normaliser les données de transcriptome vues dans les modules 4 et 5 (cf. total counts, quantiles, TMM, RLE, limma voom,…), mais nous avons choisi ici une solution simple tout en étant robuste pour normaliser les données en standardisant le 3ème quantile.
La méthode choisie ici consiste à :
écarter les gènes “non-détectés”, c’est-à-dire ceux ayant des valeurs nulles dans au moins 90% des échantillons;
écarter les gènes à peine exprimés, c’est-à-dire ceux ayant une valeur moyenne < 10 (arbitrairement);
standardiser les échantillons sur le 3ème quartile des valeurs non-nulles: on divise par le 3ème quartile de l’échantillon et on multiplie par le 3ème quartile déterminé sur l’ensemble des échantillons.
Nous fournissons ci-dessous le code.
<<<<<<< HEAD## Data filtering: genes having at least 90% null values
undetected_genes <- gene_stat_prenorm$null >= ncol(fa_expr_raw) * 0.9
print(paste0("Undetected genes (null in >= 90% samples): ", sum(undetected_genes)))[1] "Undetected genes (null in >= 90% samples): 14380"
## Data filtering: genes having a mean expression < 10
barely_expressed_genes <- gene_stat_prenorm$mean < 10
print(paste0("Barely expressed genes (mean < 10): ", sum(barely_expressed_genes)))[1] "Barely expressed genes (mean < 10): 26286"
## Apply filtering on both criteria
discarded_genes <- undetected_genes | barely_expressed_genes
print(paste0("Discarded genes: ", sum(discarded_genes)))[1] "Discarded genes: 26288"
kept_genes <- !discarded_genes
print(paste0("Kept genes: ", sum(kept_genes)))[1] "Kept genes: 20391"
## Genes after filtering
fa_expr_log2_filtered <- fa_expr_log2[kept_genes, ]## Generate a data frame where null values are replaced by NA
fa_expr_nonull <- fa_expr_log2_filtered
fa_expr_nonull[fa_expr_log2_filtered <= 0] <- NA
sum(is.na(fa_expr_nonull))[1] 5598
## Compute the 3rd quartile of non-null values for each sample and store them in a vector:
sample_q3_nonull <- apply(fa_expr_nonull, 2, quantile, prob = 0.75, na.rm = TRUE)
# print(sample_q3_nonull)
## Compute the Q3 for all the values, which will serve as target value for the standardised sample Q3
all_q3_nonull <- quantile(unlist(fa_expr_nonull), prob = 0.75, na.rm = TRUE)
# print(all_q3_nonull)
## Standardise expression on the third quartile of non-null values
## Beware : for this standardization we keep the null values
## Trick : we transpose the table to apply the ratio sample per sample,
## and then transpose the results to get back the genes in rows and samples in columns
fa_expr_log2_standard <- t(t(fa_expr_log2_filtered) * all_q3_nonull / sample_q3_nonull )
# quantile(unlist(fa_expr_log2_standard), probs = 0.75, na.rm = TRUE)
## We also compute the values for the "nonull" table for
## the sake of comparison and to check that the third quantiles of non-null
## values are well identical across samples.
fa_expr_log2_standard_nonull <- t(t(fa_expr_nonull) * all_q3_nonull / sample_q3_nonull )
# quantile(unlist(fa_expr_log2_standard_nonull), probs = 0.75, na.rm = TRUE)
## Compute Q3 before and after standardisation, including or not the null values
standardisation_impact <- data.frame(
before_all = apply(fa_expr_log2_filtered, 2, quantile, prob = 0.75, na.rm = TRUE),
before_nonull = apply(fa_expr_nonull, 2, quantile, prob = 0.75, na.rm = TRUE),
after_nonul = apply(fa_expr_log2_standard_nonull, 2, quantile, prob = 0.75, na.rm = TRUE),
after_all = apply(fa_expr_log2_standard, 2, quantile, prob = 0.75, na.rm = TRUE)
)
## Note: after standardization the Q3 of the data show some variations
## because we compute them here with the null values
kable(standardisation_impact, caption = "Impact of standardization on the third quantile (Q3) per sample. Third quantiles are computed before and after standardisation, with either all the values of the filtered table, or only the non-null values. ")## Data filtering: genes having at least 90% null values
message("Filtering undetected genes")
undetected_genes <- gene_stat_prenorm$null >= ncol(fa_expr_raw) * 0.9
print(paste0("Undetected genes (null in >= 90% samples): ", sum(undetected_genes)))[1] "Undetected genes (null in >= 90% samples): 14380"
## Data filtering: genes having a mean expression < 10
message("Filtering barely expressed genes")
barely_expressed_genes <- gene_stat_prenorm$mean < 10
print(paste0("Barely expressed genes (mean < 10): ", sum(barely_expressed_genes)))[1] "Barely expressed genes (mean < 10): 26286"
## Apply filtering on both criteria
discarded_genes <- undetected_genes | barely_expressed_genes
print(paste0("Discarded genes: ", sum(discarded_genes)))[1] "Discarded genes: 26288"
[1] "Kept genes: 20391"
## Generate a data frame where null values are replaced by NA
fa_expr_nonull <- fa_expr_log2_filtered
fa_expr_nonull[fa_expr_log2_filtered <= 0] <- NA
sum(is.na(fa_expr_nonull))[1] 5598
## Compute the 3rd quartile of non-null values for each sample and store them in a vector:
sample_q3_nonull <- apply(fa_expr_nonull, 2, quantile, prob = 0.75, na.rm = TRUE)
# print(sample_q3_nonull)
## Compute the Q3 for all the values, which will serve as target value for the standardised sample Q3
all_q3_nonull <- quantile(unlist(fa_expr_nonull), prob = 0.75, na.rm = TRUE)
# print(all_q3_nonull)
## Standardise expression on the third quartile of non-null values
## Beware : for this standardization we keep the null values
## Trick : we transpose the table to apply the ratio sample per sample,
## and then transpose the results to get back the genes in rows and samples in columns
fa_expr_log2_standard <- t(t(fa_expr_log2_filtered) * all_q3_nonull / sample_q3_nonull )
# quantile(unlist(fa_expr_log2_standard), probs = 0.75, na.rm = TRUE)
## We also compute the values for the "nonull" table for
## the sake of comparison and to check that the third quantiles of non-null
## values are well identical across samples.
fa_expr_log2_standard_nonull <- t(t(fa_expr_nonull) * all_q3_nonull / sample_q3_nonull )
# quantile(unlist(fa_expr_log2_standard_nonull), probs = 0.75, na.rm = TRUE)
## Compute Q3 before and after standardisation, including or not the null values
standardisation_impact <- data.frame(
before_all = apply(fa_expr_log2_filtered, 2, quantile, prob = 0.75, na.rm = TRUE),
before_nonull = apply(fa_expr_nonull, 2, quantile, prob = 0.75, na.rm = TRUE),
after_nonul = apply(fa_expr_log2_standard_nonull, 2, quantile, prob = 0.75, na.rm = TRUE),
after_all = apply(fa_expr_log2_standard, 2, quantile, prob = 0.75, na.rm = TRUE)
)
## Note: after standardization the Q3 of the data show some variations
## because we compute them here with the null values
kable(standardisation_impact, caption = "Impact of standardization on the third quantile (Q3) per sample. Third quantiles are computed before and after standardisation, with either all the values of the filtered table, or only the non-null values. ")| before_all | before_nonull | after_nonul | after_all | |
|---|---|---|---|---|
| normal_1 | 7.892971 | 7.929471 | 8.546124 | 8.506785 |
| normal_2 | 9.095259 | 9.120100 | 8.546124 | 8.522846 |
| normal_3 | 7.690496 | 7.728764 | 8.546124 | 8.503808 |
| day1_1 | 8.275875 | 8.310843 | 8.546124 | 8.510165 |
| day1_2 | 9.215737 | 9.236819 | 8.546124 | 8.526618 |
| day1_3 | 9.339495 | 9.356787 | 8.546124 | 8.530330 |
| day2_1 | 9.372222 | 9.391459 | 8.546124 | 8.528618 |
| day2_2 | 6.401122 | 6.563217 | 8.546124 | 8.335056 |
| day2_3 | 8.428494 | 8.462042 | 8.546124 | 8.512242 |
| day3_1 | 8.600914 | 8.618327 | 8.546124 | 8.528857 |
| day3_2 | 8.449052 | 8.476090 | 8.546124 | 8.518862 |
| day3_3 | 8.929730 | 8.945829 | 8.546124 | 8.530744 |
| day7_1 | 8.637420 | 8.652095 | 8.546124 | 8.531628 |
| day7_2 | 8.457495 | 8.478531 | 8.546124 | 8.524920 |
| day7_3 | 8.568555 | 8.585032 | 8.546124 | 8.529721 |
| day14_1 | 8.181662 | 8.194184 | 8.546124 | 8.533064 |
| day14_2 | 8.339087 | 8.364105 | 8.546124 | 8.520562 |
| day14_3 | 8.313689 | 8.334105 | 8.546124 | 8.525188 |
A vous de jouer!
Générez un data.frame avec une ligne par gène à partir du tableau de données normalisées, avec les statistiques suivantes (une statistique par colonne):
## Gene-wise statistics after normalisation
gene_stat_norm <- data.frame(
mean = apply(fa_expr_log2_standard, 1, mean, na.rm = F),
var = apply(fa_expr_log2_standard, 1, var, na.rm = TRUE),
sd = apply(fa_expr_log2_standard, 1, sd, na.rm = F),
CV = NA,
min = apply(fa_expr_log2_standard, 1, min, na.rm = TRUE),
Q1 = apply(fa_expr_log2_standard, 1, quantile, p = 0.25, na.rm = TRUE),
median = apply(fa_expr_log2_standard, 1, median, na.rm = TRUE),
Q3 = apply(fa_expr_log2_standard, 1, quantile, p = 0.75, na.rm = TRUE),
max = apply(fa_expr_log2_standard, 1, max, na.rm = TRUE),
null = apply(fa_expr_log2_standard == 0, 1, sum, na.rm = TRUE)
)
gene_stat_norm$CV <- gene_stat_norm$sd / gene_stat_norm$mean## Gene-wise statistics after normalisation
message("Computing gene-wise statistics on log2-transformed and normalised counts")
gene_stat_norm <- data.frame(
mean = apply(fa_expr_log2_standard, 1, mean, na.rm = F),
var = apply(fa_expr_log2_standard, 1, var, na.rm = TRUE),
sd = apply(fa_expr_log2_standard, 1, sd, na.rm = F),
CV = NA,
min = apply(fa_expr_log2_standard, 1, min, na.rm = TRUE),
Q1 = apply(fa_expr_log2_standard, 1, quantile, p = 0.25, na.rm = TRUE),
median = apply(fa_expr_log2_standard, 1, median, na.rm = TRUE),
Q3 = apply(fa_expr_log2_standard, 1, quantile, p = 0.75, na.rm = TRUE),
max = apply(fa_expr_log2_standard, 1, max, na.rm = TRUE),
null = apply(fa_expr_log2_standard == 0, 1, sum, na.rm = TRUE)
)
gene_stat_norm$CV <- gene_stat_norm$sd / gene_stat_norm$meanChaque gène étant donné par son identifiant dans la base de données ENSEMBL vous utiliserez le paquet biomaRt de bioconductor pour ajouter des annotations : symbole, chromosome, coordonnées génomiques, brin. Suivez pas à pas la méthode proposée:
sélectionnez la base de données ENSEMBL avec la fonction useMart(). Attention à choisir le bon génome avec l’agument dataset: mmusculus_gene_ensembl
avec la fonction getBM() récupérez de la base de données ENSEMBL les champs demandés (pour symbole utilisez external_gene_name) en appliquant “ensembl_geneid” pour l’agument filter et en indiquant pour l’argument values le vecteur des identifiants des gènes présents dans le dataframe gene_stat_norm. Vous obtenez un dataframe.
A présent, ajoutez au dataframe gene_stat_norm en 1ères colonnes les annotations retrouvées grâce à biomaRt. Attention, certains gènes ne sont pas retrouvés dans la version d’ENSEMBL sur biomaRt donc laissez des NA comme données manquantes dans ce cas. Nous vous recommandons d’utiliser la function merge() ou left_join() de dplyr pour fusionner les deux dataframes en un seul.
ensembl <- useMart("ENSEMBL_MART_ENSEMBL", host = "www.ensembl.org", dataset = "mmusculus_gene_ensembl")
genes <-getBM(attributes = c("ensembl_gene_id", "external_gene_name", "chromosome_name",
"start_position", "end_position", "strand"),
filter = "ensembl_gene_id",
values = row.names(gene_stat_norm),
mart = ensembl
)
gene_stat_norm <- merge(genes, gene_stat_norm, by.x = "ensembl_gene_id", by.y = 0, sort = FALSE)
kable(gene_stat_norm[100:109, ], caption = "Gene-wise statistics after normalisation")| ensembl_gene_id | external_gene_name | chromosome_name | start_position | end_position | strand | mean | var | sd | CV | min | Q1 | median | Q3 | max | null | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 100 | ENSMUSG00000000724 | Cryba1 | 11 | 77609441 | 77616109 | -1 | 5.595006 | 0.1972841 | 0.4441668 | 0.0793863 | 4.816940 | 5.423046 | 5.632564 | 5.829410 | 6.426584 | 0 |
| 101 | ENSMUSG00000000731 | Aire | 10 | 77865856 | 77879444 | -1 | 4.936591 | 0.5868326 | 0.7660500 | 0.1551779 | 3.715427 | 4.333682 | 4.984400 | 5.510108 | 6.247104 | 0 |
| 102 | ENSMUSG00000000732 | Icosl | 10 | 77905136 | 77919747 | 1 | 5.543710 | 2.5859031 | 1.6080743 | 0.2900719 | 2.534509 | 4.617496 | 5.736471 | 6.856687 | 8.250127 | 0 |
| 103 | ENSMUSG00000000738 | Spg7 | 8 | 123789681 | 123824499 | 1 | 8.426883 | 0.2746445 | 0.5240653 | 0.0621897 | 7.477862 | 8.245098 | 8.330412 | 8.699996 | 9.484120 | 0 |
| 104 | ENSMUSG00000000740 | Rpl13 | 8 | 123829089 | 123831983 | 1 | 10.820838 | 0.1330747 | 0.3647941 | 0.0337122 | 10.156481 | 10.561094 | 10.850259 | 11.066164 | 11.622910 | 0 |
| 105 | ENSMUSG00000000743 | Chmp1a | 8 | 123931003 | 123939502 | -1 | 7.619693 | 1.1794074 | 1.0860052 | 0.1425261 | 6.110121 | 7.070298 | 7.398740 | 7.908298 | 10.866373 | 0 |
| 106 | ENSMUSG00000000751 | Rpa1 | 11 | 75188992 | 75239150 | -1 | 7.368976 | 0.5226199 | 0.7229245 | 0.0981038 | 5.984878 | 7.065193 | 7.241989 | 7.676944 | 9.010795 | 0 |
| 107 | ENSMUSG00000000753 | Serpinf1 | 11 | 75300595 | 75313527 | -1 | 2.830650 | 2.9689424 | 1.7230619 | 0.6087160 | 0.000000 | 2.595079 | 3.131414 | 3.959656 | 5.177214 | 4 |
| 108 | ENSMUSG00000000759 | Tubgcp3 | 8 | 12664277 | 12722248 | -1 | 8.617902 | 0.3027336 | 0.5502123 | 0.0638453 | 7.904374 | 8.308391 | 8.497898 | 8.791621 | 10.039692 | 0 |
| 109 | ENSMUSG00000000766 | Oprm1 | 10 | 6708506 | 6988198 | 1 | 5.356086 | 0.6398959 | 0.7999349 | 0.1493506 | 3.297268 | 4.832527 | 5.519335 | 5.837577 | 6.401143 | 0 |
Challenge falcultatif:
Enfin, réordonnez les gènes par position génomique et affichez les lignes 5 premières et 5 dernières lignes de ce tableau de statistiques.
<<<<<<< HEADgene_stat_norm <- gene_stat_norm[order(gene_stat_norm$chromosome_name, gene_stat_norm$start_position),]
kable(gene_stat_norm[c(1:5, (nrow(gene_stat_norm)-4):nrow(gene_stat_norm)), ], caption = "Gene-wise statistics after normalisation.")| ensembl_gene_id | external_gene_name | chromosome_name | start_position | end_position | strand | mean | var | sd | CV | min | Q1 | median | Q3 | max | null | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11991 | ENSMUSG00000051951 | Xkr4 | 1 | 3276124 | 3741721 | -1 | 4.779755 | 1.5566203 | 1.2476459 | 0.2610272 | 2.624881 | 3.8434294 | 4.769964 | 5.895264 | 6.801471 | 0 |
| 17939 | ENSMUSG00000103377 | Gm37180 | 1 | 3435954 | 3438772 | -1 | 2.371001 | 4.1080702 | 2.0268375 | 0.8548445 | 0.000000 | 0.2492876 | 1.809025 | 4.227908 | 5.706440 | 5 |
| 17870 | ENSMUSG00000103161 | Gm38148 | 1 | 3663115 | 3666126 | -1 | 6.987419 | 0.2946434 | 0.5428107 | 0.0776840 | 6.414353 | 6.7053245 | 6.979361 | 7.133402 | 8.854627 | 0 |
| 17551 | ENSMUSG00000102331 | Gm19938 | 1 | 3717532 | 3729127 | -1 | 4.711302 | 0.6605703 | 0.8127548 | 0.1725117 | 2.812792 | 4.5093438 | 4.750160 | 4.986285 | 6.770961 | 0 |
| 17647 | ENSMUSG00000102592 | Gm38385 | 1 | 3822233 | 3824583 | 1 | 5.854525 | 0.6717783 | 0.8196208 | 0.1399978 | 4.659168 | 5.2800025 | 5.756853 | 6.210068 | 8.211908 | 0 |
| 17896 | ENSMUSG00000103234 | Gm37158 | Y | 16691330 | 16697033 | -1 | 11.652663 | 0.1297206 | 0.3601674 | 0.0309086 | 11.198733 | 11.3190509 | 11.590467 | 11.964794 | 12.385978 | 0 |
| 18295 | ENSMUSG00000104381 | Gm37032 | Y | 40144240 | 40147214 | 1 | 2.826286 | 3.2938714 | 1.8149026 | 0.6421511 | 0.000000 | 2.3552767 | 2.888919 | 3.775050 | 7.017864 | 3 |
| 17372 | ENSMUSG00000101106 | Gm28819 | Y | 68997874 | 69008469 | 1 | 2.663406 | 2.7687782 | 1.6639646 | 0.6247506 | 0.000000 | 2.1148791 | 2.549968 | 3.457991 | 6.756537 | 2 |
| 16671 | ENSMUSG00000096768 | Gm47283 | Y | 90796007 | 90827734 | 1 | 8.509179 | 0.7688016 | 0.8768133 | 0.1030432 | 7.211906 | 7.7011795 | 8.531591 | 9.082058 | 10.206865 | 0 |
| 17213 | ENSMUSG00000099871 | Gm21742 | Y | 90848682 | 90855309 | 1 | 2.162429 | 3.6862082 | 1.9199500 | 0.8878674 | 0.000000 | 1.0430928 | 1.815921 | 3.214804 | 6.394224 | 4 |
hist(unlist(fa_expr_log2_standard),
breaks = seq(from = 0, to = max(fa_expr_log2_standard) + 1, by = 0.25),
xlab = "log2(counts) after standardisation",
ylab = "number of genes after filtering",
col = "#BBDDFF",
las = 1, cex.axis = 0.8,
main = "distribution after standardisation")
abline(v = mean(fa_expr_log2_standard), col = "darkgreen", lwd = 2)hist(unlist(fa_expr_log2_standard),
breaks = seq(from = 0, to = max(fa_expr_log2_standard) + 1, by = 0.25),
xlab = "log2(counts) after standardisation",
ylab = "number of genes after filtering",
col = "#BBDDFF",
las = 1, cex.axis = 0.8,
main = "distribution after standardisation")
abline(v = mean(fa_expr_log2_standard), col = "darkgreen", lwd = 2)Distribution of expression values (log2 counts) after gene filtering and standardisation on the sample-wise third-quartile of non-null values. The vertical line highlights the mean value.
#### Box plots to show normalisation impact ####
par(mar = c(4,6,4,1)) ## Set the margins
par(mfrow = c(2,2))
boxplot(fa_expr_log2_filtered,
horizontal = TRUE,
xlab = "log2(counts)",
las = 1,
col = fa_meta$color,
main = "Before standardisation\nall values")
boxplot(fa_expr_nonull,
horizontal = TRUE,
xlab = "log2(counts)",
las = 1,
col = fa_meta$color,
main = "Before standardisation\nzeros discarded")
boxplot(fa_expr_log2_standard_nonull,
horizontal = TRUE,
xlab = "log2(counts)",
las = 1,
col = fa_meta$color,
main = "Standardised\nzeros discarded")
boxplot(fa_expr_log2_standard,
xlab = "log2(counts)",
las = 1,
horizontal = TRUE,
col = fa_meta$color,
main = "Standardised\nall values")#### Box plots to show normalisation impact ####
par(mar = c(4,6,4,1)) ## Set the margins
par(mfrow = c(2,2))
boxplot(fa_expr_log2_filtered,
horizontal = TRUE,
xlab = "log2(counts)",
las = 1,
col = fa_meta$color,
main = "Before standardisation\nall values")
boxplot(fa_expr_nonull,
horizontal = TRUE,
xlab = "log2(counts)",
las = 1,
col = fa_meta$color,
main = "Before standardisation\nzeros discarded")
boxplot(fa_expr_log2_standard_nonull,
horizontal = TRUE,
xlab = "log2(counts)",
las = 1,
col = fa_meta$color,
main = "Standardised\nzeros discarded")
boxplot(fa_expr_log2_standard,
xlab = "log2(counts)",
las = 1,
horizontal = TRUE,
col = fa_meta$color,
main = "Standardised\nall values")Box plots showing the impact of normalisation
par(mfrow = c(1, 1))
par(mar = c(4,5,5,1))Pour réduire le nombre de gènes, nous allons écarter les gènes faiblement exprimés (log2 moyen inférieur à 4), et ne retenir que ceux qui montrent des variations importantes entre échantillons. Pour ce dernier critère, nous nous basons sur le coefficient de variation afin de relativiser la dispersion (écart type) par rapport à la tendance centrale (moyenne).
Sélectionnez les gènes ayant un niveau log2 moyen minimal supérieur à 3 (\(s > 3\)) et un coefficient de variation supérieur à 0.5 (\(CV > 0.5\)). Note: ces valeurs sont parfaitement arbitraires, elles ont été choisies pour obtenir un nombre raisonnable de gènes.
## Compute a Boolean vector indicating whether each gene passes or not the expression level threshold
high_expression <- gene_stat_norm$mean > 3
# table(high_expression) # count number of genes with high/weak expression
## Compute a Boolean vector indicating whether each gene passes or not the variation coefficient threshold
high_variation <- gene_stat_norm$CV > 0.5
# table(high_variation) # count number of genes with weak high coeffficient of variation
## Compute a Boolean vector indicating whether each gene passes or not the variance threshold
# high_variance <- gene_stat_norm$var > 2
# table(high_variance) # count number of genes with weak high variance
## Select genes having both a high mean expression and a high variation coefficien
selected_genes <- high_variation & high_expression
# table(selected_genes) # count number of genes with weak high coeffficient of variation
print(paste0("Selected genes: ", sum(selected_genes)))[1] "Selected genes: 473"
## Create a data frame with the expression of the selected genes
fa_expr_selected <- fa_expr_log2_standard[selected_genes, ]Dessinez des histogrammes des valeurs d’expression avant et après cette sélection de gènes, et commentez les différences.
#### Histograms of expression before and after gene selection ####
par(mfrow = c(2,1))
hist(unlist(fa_expr_log2_standard),
breaks = seq(from = 0, to = max(fa_expr_log2_standard) + 1, by = 0.25),
las = 1,
cex.axis = 0.8,
main = "Standardized values before gene selection",
col = "#DDBBFF")
hist(unlist(fa_expr_selected),
breaks = seq(from = 0, to = max(fa_expr_log2_standard) + 1, by = 0.25),
las = 1,
cex.axis = 0.8,
main = "Standardized values after gene selection",
col = "#FFDDBB")Pour réduire le nombre de gènes, nous allons écarter les gènes faiblement exprimés (log2 moyen inférieur à 4), et ne retenir que ceux qui montrent des variations importantes entre échantillons. Pour ce dernier critère, nous nous basons sur le coefficient de variation afin de relativiser la dispersion (écart type) par rapport à la tendance centrale (moyenne).
Sélectionnez les gènes ayant un niveau log2 moyen minimal supérieur à 5 (\(m > 5\)) et une variance supérieure à 2 (\(s^2 > 2\)). Note: ces valeurs sont parfaitement arbitraires, elles ont été choisies pour obtenir un nombre raisonnable de gènes.
## Compute a Boolean vector indicating whether each gene passes or not the expression level threshold
high_expression <- gene_stat_norm$mean > 5
# table(high_expression) # count number of genes with high/weak expression
## Compute a Boolean vector indicating whether each gene passes or not the variation coefficient threshold
high_variation <- gene_stat_norm$CV > 0.5
# table(high_variation) # count number of genes with weak high coeffficient of variation
## Compute a Boolean vector indicating whether each gene passes or not the variance threshold
high_variance <- gene_stat_norm$var > 2
# table(high_variance) # count number of genes with weak high variance
# ## Select genes having both a high mean expression and a high variation coefficien
# selected_genes <- high_variation & high_expression
## Select genes having both a high mean expression and a high variance
selected_genes <- high_variance & high_expression
# table(selected_genes) # count number of genes with weak high coeffficient of variation
print(paste0("Selected genes: ", sum(selected_genes)))[1] "Selected genes: 874"
## Create a data frame with the expression of the selected genes
fa_expr_selected <- fa_expr_log2_standard[selected_genes, ]Dessinez des histogrammes des valeurs d’expression avant et après cette sélection de gènes, et commentez les différences.
#### Histograms of expression before and after gene selection ####
par(mfrow = c(2,1))
hist(unlist(fa_expr_log2_standard),
breaks = seq(from = 0, to = max(fa_expr_log2_standard) + 1, by = 0.25),
las = 1,
cex.axis = 0.8,
main = "Standardized values before gene selection",
col = "#DDBBFF")
hist(unlist(fa_expr_selected),
breaks = seq(from = 0, to = max(fa_expr_log2_standard) + 1, by = 0.25),
las = 1,
cex.axis = 0.8,
main = "Standardized values after gene selection",
col = "#FFDDBB")Distribution of expression values before and after gene selection
par(mfrow = c(1,1))
# ## Some quick checks: the selection of highly variable genes select those having many zeros - and high values in other samples
# hist(unlist(fa_expr_log2_filtered[high_expression, ]), breaks=100)
# hist(unlist(fa_expr_log2_filtered[high_variation, ]), breaks=100)
# hist(unlist(fa_expr_log2_filtered[!high_variation, ]), breaks=100)
# hist(unlist(fa_expr_log2_filtered[selected_genes, ]), breaks=100)Dessinez un box plot par échantillon des valeurs d’expression avant et après sélection des gènes, et commentez le résultat.
#### Histogram of expression after gene selection ####
par(mfrow = c(1,2))
boxplot(fa_expr_log2_standard,
horizontal = TRUE,
xlab = "log2(counts)",
las = 1,
col = fa_meta$color,
main = "Before gene selection\nstandardised values")
boxplot(fa_expr_selected,
horizontal = TRUE,
xlab = "log2(counts)",
las = 1,
col = fa_meta$color,
main = "After gene selection\nstandardised values")par(mfrow = c(1,1))
# ## Some quick checks: the selection of highly variable genes select those having many zeros - and high values in other samples
# hist(unlist(fa_expr_log2_filtered[high_expression, ]), breaks=100)
# hist(unlist(fa_expr_log2_filtered[high_variation, ]), breaks=100)
# hist(unlist(fa_expr_log2_filtered[!high_variation, ]), breaks=100)
# hist(unlist(fa_expr_log2_filtered[selected_genes, ]), breaks=100)Dessinez un box plot par échantillon des valeurs d’expression avant et après sélection des gènes, et commentez le résultat.
#### Histogram of expression after gene selection ####
par(mfrow = c(1,2))
boxplot(fa_expr_log2_standard,
horizontal = TRUE,
xlab = "log2(counts)",
las = 1,
col = fa_meta$color,
main = "Before gene selection\nstandardised values")
boxplot(fa_expr_selected,
horizontal = TRUE,
xlab = "log2(counts)",
las = 1,
col = fa_meta$color,
main = "After gene selection\nstandardised values")Box plots of standardised expression values before and after gene selection.
par(mfrow = c(1,1))Dessinez un plot ACP des échantillons en les colorant par condition avant et après normalisation.
## Raw expression values, all genes
ma_pca_raw_tt <- PCA(t(fa_expr_raw),
scale.unit = FALSE,
graph = FALSE)
# plot(ma_pca_raw_tt, choix = "ind")
fviz_pca_ind(ma_pca_raw_tt, col.ind = fa_meta[, "color"])## Raw expression values, all genes
ma_pca_raw_tt <- PCA(t(fa_expr_raw),
scale.unit = FALSE,
graph = FALSE)
# plot(ma_pca_raw_tt, choix = "ind")
fviz_pca_ind(ma_pca_raw_tt, col.ind = fa_meta[, "color"])PC plot of the samples from the raw expression values of all genes.
ma_pca_filtered <- PCA(t(fa_expr_log2_filtered), scale.unit = FALSE,
graph = FALSE)
# plot(ma_pca_filtered, choix = "var")
# plot(ma_pca_sel, choix = "ind")
fviz_pca_ind(ma_pca_filtered, col.ind = fa_meta[, "color"])PC plot of the samples from standardised values after gene selection. =======
ma_pca_filtered <- PCA(t(fa_expr_log2_filtered), scale.unit = FALSE,
graph = FALSE)
# plot(ma_pca_filtered, choix = "var")
# plot(ma_pca_sel, choix = "ind")
fviz_pca_ind(ma_pca_filtered, col.ind = fa_meta[, "color"])PC plot of the samples from standardised values before gene selection. >>>>>>> b0d26ca57d19a113b96a754cdf1d9ad0a4b3d5be
ma_pca_sel <- PCA(t(fa_expr_selected), scale.unit = FALSE,
graph = FALSE)
# plot(ma_pca_sel, choix = "var")
# plot(ma_pca_sel, choix = "ind")
fviz_pca_ind(ma_pca_sel, col.ind = fa_meta[, "color"])ma_pca_sel <- PCA(t(fa_expr_selected), scale.unit = FALSE,
graph = FALSE)
# plot(ma_pca_sel, choix = "var")
# plot(ma_pca_sel, choix = "ind")
fviz_pca_ind(ma_pca_sel, col.ind = fa_meta[, "color"])PC plot of the samples from standardised values after gene selection.
dist()), coefficient de Pearson (cor(, method = "pearson")) et de Spearman (cor(, method = "spearman")).#### Sample distances ####
dist_euc_sel <- dist(t(fa_expr_selected))
cor_pearson_sel <- as.dist(1 - cor(fa_expr_selected))
cor_spearman_sel <- 1 - as.dist(cor(fa_expr_selected, method = "spearman"))ward.d2) pour l’agglomération. Comparez les arbres d’échantillons obtenus avec ces trois métriques et choisissez celle qui vous paraît la plus pertinente.#### Sample clustering ####
par(mfrow = c(1,3))
plot(hclust(dist_euc_sel), hang = -1,
main = "euclidean distance")
plot(hclust(cor_pearson_sel), hang = -1,
main = "pearson")
plot(hclust(cor_spearman_sel), hang = -1,
main = "spearman")#### Sample distances ####
dist_euc_sel <- dist(t(fa_expr_selected))
cor_pearson_sel <- as.dist(1 - cor(fa_expr_selected))
cor_spearman_sel <- 1 - as.dist(cor(fa_expr_selected, method = "spearman"))ward.d2) pour l’agglomération. Comparez les arbres d’échantillons obtenus avec ces trois métriques et choisissez celle qui vous paraît la plus pertinente.#### Sample clustering ####
par(mfrow = c(1,3))
plot(hclust(dist_euc_sel), hang = -1,
main = "euclidean distance")
plot(hclust(cor_pearson_sel), hang = -1,
main = "pearson")
plot(hclust(cor_spearman_sel), hang = -1,
main = "spearman")Sample tree with three alternative distance metrics: Euclidiant distance (left), Pearson correlation (center), Spearman correlation (right).?
par(mfrow = c(1,1))cor_pearson_gene <- as.dist(1 - cor(t(fa_expr_selected)))
plot(hclust(cor_pearson_gene), hang = -1,
main = "gènes")plot(hclust(cor_pearson_gene), hang = -1,
main = "gènes")
rect.hclust(hclust(cor_pearson_gene), k = 7)pheatmap(t(fa_expr_selected),
clustering_distance_cols = "correlation", clustering_distance_rows = "correlation")cor_pearson_gene <- as.dist(1 - cor(t(fa_expr_selected)))
plot(hclust(cor_pearson_gene), hang = -1,
main = "gènes")plot(hclust(cor_pearson_gene), hang = -1,
main = "gènes")
rect.hclust(hclust(cor_pearson_gene), k = 7)pheatmap(t(fa_expr_selected),
clustering_distance_cols = "correlation", clustering_distance_rows = "correlation")Interprétez les résultats en quelques phrases.
Effectuez une analyse d’enrichissement fonctionnel avec les principaux clusters obtenus dans la section précédente.
#### Run enrichment analysis with gost() ####
message("Running enrichment analysis with gost")
gene_list <- unlist(subset(x = gene_stat_norm,
subset = selected_genes,
select = external_gene_name))
# length(gene_list)
gostres <- gost(query = gene_list,
organism = "mmusculus",
ordered_query = FALSE,
multi_query = FALSE,
significant = TRUE,
exclude_iea = FALSE,
measure_underrepresentation = FALSE,
evcodes = FALSE,
user_threshold = 0.05,
correction_method = "fdr",
domain_scope = "annotated",
custom_bg = NULL,
numeric_ns = "",
sources = NULL,
as_short_link = FALSE)
## Check the structure of the result
names(gostres)[1] "result" "meta"
[1] "query" "significant" "p_value" "term_size" "query_size" "intersection_size" "precision" "recall" "term_id" "source" "term_name" "effective_domain_size"
[13] "source_order" "parents"
| Length | Class | Mode | |
|---|---|---|---|
| result | 14 | data.frame | list |
| meta | 5 | -none- | list |
# hist(gostres$result$p_value, breaks=seq(from=0, to=1, by = 0.05))
## Check the most significant results, formating for kable
#head(gostres$result)
enrich_order <- order(gostres$result$p_value, decreasing = FALSE)
sorted_result <- gostres$result[enrich_order, ]
kable(head(sorted_result, n = 10),
digits = c(0, 0, 15, 7, 7, 7, 3, 3, 0, 0, 7, 7, 0, 0),
caption = "Top 10 most significant enriched functional classes")| query | significant | p_value | term_size | query_size | intersection_size | precision | recall | term_id | source | term_name | effective_domain_size | source_order | parents | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 58 | query_1 | TRUE | 0.000000008144159 | 92 | 340 | 21 | 0.062 | 0.228 | KEGG:04610 | KEGG | Complement and coagulation cascades | 8787 | 299 | KEGG:00000 |
| 2 | query_1 | TRUE | 0.000000042399671 | 272 | 409 | 35 | 0.086 | 0.129 | GO:0006955 | GO:BP | immune response | 11240 | 2901 | GO:0002376, GO:0050896 |
| 3 | query_1 | TRUE | 0.000000042399671 | 331 | 409 | 39 | 0.095 | 0.118 | GO:0002376 | GO:BP | immune system process | 11240 | 1038 | GO:0008150 |
| 25 | query_1 | TRUE | 0.000000169692312 | 627 | 255 | 53 | 0.208 | 0.085 | GO:0071944 | GO:CC | cell periphery | 7218 | 3147 | GO:0110165 |
| 26 | query_1 | TRUE | 0.000003143387113 | 559 | 255 | 46 | 0.180 | 0.082 | GO:0005886 | GO:CC | plasma membrane | 7218 | 522 | GO:0016020, GO:0071944 |
| 88 | query_1 | TRUE | 0.000005079383120 | 114 | 357 | 21 | 0.059 | 0.184 | REAC:R-MMU-381426 | REAC | Regulation of Insulin-like Growth Factor (IGF) transport and uptake by Insulin-like Growth Factor Binding Proteins (IGFBPs) | 8631 | 1220 | REAC:R-MMU-392499 |
| 89 | query_1 | TRUE | 0.000005349656278 | 108 | 357 | 20 | 0.056 | 0.185 | REAC:R-MMU-8957275 | REAC | Post-translational protein phosphorylation | 8631 | 1054 | REAC:R-MMU-597592 |
| 59 | query_1 | TRUE | 0.000012274925197 | 116 | 340 | 19 | 0.056 | 0.164 | KEGG:05150 | KEGG | Staphylococcus aureus infection | 8787 | 416 | KEGG:00000 |
| 27 | query_1 | TRUE | 0.000019834136256 | 10 | 255 | 6 | 0.024 | 0.600 | GO:0042613 | GO:CC | MHC class II protein complex | 7218 | 2067 | GO:0042611 |
| 90 | query_1 | TRUE | 0.000019912758023 | 258 | 357 | 31 | 0.087 | 0.120 | REAC:R-MMU-1474244 | REAC | Extracellular matrix organization | 8631 | 452 | REAC:0000000 |
# kable(head(gostres$result),
# format = c("%s", "%s", "%e", "%d", "%d", "%d", "%.3f", "%.3f", "%s", "%s", "%d", "%d", "%s", "%s"))
names(gostres$meta)[1] "query_metadata" "result_metadata" "genes_metadata" "timestamp" "version"
Résumez en quelques phrases vos conclusions à partir des résultats obtenus.